speech encoder
- Europe > Austria > Vienna (0.14)
- Asia > South Korea > Incheon > Incheon (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (12 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
SENSE models: an open source solution for multilingual and multimodal semantic-based tasks
Mdhaffar, Salima, Elleuch, Haroun, Chellaf, Chaimae, Nguyen, Ha, Estève, Yannick
Abstract--This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI's SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder . The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders. Speech foundation models based on self-supervised learning (SSL) have brought significant advances in speech processing. These models, such as wav2vec 2.0 [1], HuBERT [2], and WavLM [3], generate learned speech representations that can be applied to a wide range of downstream speech processing tasks. By training on large amounts of unlabelled speech data, SSL models have demonstrated the ability to capture crucial speech features, such as phonemes and other acoustic units [4]. This capability has led to significant progress in multiple downstream tasks, including speech recognition [1], speech translation [5], speech separation, speaker verification, speaker diarization [3], and emotion detection [6]. Different approaches have been proposed to pretrain model by aligning speech and text, like mSLAM [7], a Massively multilingual joint pre-training for speech and text.
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment
Gao, Yan, Yang, Yazheng, Lan, Zhibin, Chen, Yidong, Zhang, Min, Wei, Daimeng, Huang, Hui, Su, Jinsong
Code-switching (CS) speech translation (ST) refers to translating speech that alternates between two or more languages into a target language text, which poses significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies tend to rely on the model itself to implicitly learn semantic modeling during training, and resort to inefficient and costly manual annotations for these two challenges. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture of Experts (MoE) speech projector, where each expert specializes in the semantic subspace of a specific language, enabling fine-grained modeling of speech features. Additionally, we introduce a multi-stage training paradigm that utilizes readily available monolingual automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation capabilities. During training, we leverage a combination of language-specific loss and intra-group load balancing loss to guide the MoE speech projector in efficiently allocating tokens to the appropriate experts, across expert groups and within each group, respectively. To bridge the data gap across different training stages and improve adaptation to the CS scenario, we further employ a transition loss, enabling smooth transitions of data between stages, to effectively address the scarcity of high-quality CS speech translation data. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Spain (0.04)
- Europe > Austria > Styria > Graz (0.04)
- Asia (0.04)
- Materials > Metals & Mining (0.46)
- Government (0.46)
Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
Omnilingual ASR team, null, Keren, Gil, Kozhevnikov, Artyom, Meng, Yen, Ropers, Christophe, Setzler, Matthew, Wang, Skyler, Adebara, Ife, Auli, Michael, Balioglu, Can, Chan, Kevin, Cheng, Chierh, Chuang, Joe, Droof, Caley, Duppenthaler, Mark, Duquenne, Paul-Ambroise, Erben, Alexander, Gao, Cynthia, Gonzalez, Gabriel Mejia, Lyu, Kehan, Miglani, Sagar, Pratap, Vineel, Sadagopan, Kaushik Ram, Saleem, Safiyyah, Turkatenko, Arina, Ventayol-Boada, Albert, Yong, Zheng-Xin, Chung, Yu-An, Maillard, Jean, Moritz, Rashel, Mourachko, Alexandre, Williamson, Mary, Yates, Shireen
Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.
- North America > Canada > Alberta (0.14)
- Europe > Austria > Vienna (0.14)
- Africa > Sudan (0.14)
- (53 more...)
- Health & Medicine (1.00)
- Education (0.67)
- Information Technology (0.67)
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
Feng, Pengchao, Ma, Ziyang, Chen, Wenxi, Li, Yao, Wang, Sheng, Yu, Kai, Chen, Xie
End-to-end speech-to-speech (S2S) dialogue systems have recently garnered increasing research attention for their lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration of information. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind the SOTA cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. Our code and dataset are released.
- North America > United States (0.14)
- Asia > China > Shanghai > Shanghai (0.05)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- (4 more...)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction
Zhang, Binbin, Liang, Chengdong, Wang, Shuai, Geng, Xuelong, Guo, Zhao, Li, Haoyu, Yin, Hao, Yang, Xipeng, Zhang, Pengshen, Ma, Changwei, Xie, Lei
In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/
Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations
Ghannam, Ahmad, Alharthi, Naif, Alasmary, Faris, Tabash, Kholood Al, Sadah, Shouq, Ghouti, Lahouari
In this work, we tackle the Diacritic Restoration (DR) task for Arabic dialectal sentences using a multimodal approach that combines both textual and speech information. We propose a model that represents the text modality using an encoder extracted from our own pre-trained model named CATT. The speech component is handled by the encoder module of the OpenAI Whisper base model. Our solution is designed following two integration strategies. The former consists of fusing the speech tokens with the input at an early stage, where the 1500 frames of the audio segment are averaged over 10 consecutive frames, resulting in 150 speech tokens. To ensure embedding compatibility, these averaged tokens are processed through a linear projection layer prior to merging them with the text tokens. Contextual encoding is guaranteed by the CATT encoder module. The latter strategy relies on cross-attention, where text and speech embeddings are fused. The cross-attention output is then fed to the CATT classification head for token-level diacritic prediction. To further improve model robustness, we randomly deactivate the speech input during training, allowing the model to perform well with or without speech. Our experiments show that the proposed approach achieves a word error rate (WER) of 0.25 and a character error rate (CER) of 0.9 on the development set. On the test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.
- Asia > Thailand > Bangkok > Bangkok (0.05)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- Asia > Middle East > Saudi Arabia > Riyadh Province > Riyadh (0.05)
- (4 more...)
Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models
Ògúnrèmí, Tolúlopé, Manning, Christopher D., Jurafsky, Dan, Livescu, Karen
Abstract--Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Y et we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don't, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.
- North America > Canada > Quebec > Montreal (0.05)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Ghazal, Nizar El, Caubrière, Antoine, Vielzeuf, Valentin
ABSTRACT This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.